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Discrete Fourier transforms (DFTs) over finite fields have widespread applications in digital com- 
munication and storage systems. Hence, reducing the computational complexities of DFTs is of great 
significance. Recently proposed cyclotomic fast Fourier transforms (CFFTs) are promising due to their 
low multiplicative complexities. Unfortunately, there are two issues with CFFTs: (1) they rely on efficient 
short cyclic convolution algorithms, which has not been sufficiently investigated in the literature, and 
(2) they have very high additive complexities when directly implemented. To address both issues, we 
make three main contributions in this paper. First, for any odd prime p, we reformulate a p-point cyclic 
convolution as a product of a (jp — 1) x (p — 1) Toeplitz matrix vector products (TMVP), which can be 
obtained from well-known TMVP of very small sizes, leading to efficient bilinear algorithms for p-point 
cyclic convolutions. Second, to address the high additive complexities of CFFTs, we propose composite 
cyclotomic Fourier transforms (CCFTs). In comparison to previously proposed fast Fourier transforms, our 
CCFTs achieve lower overall complexities for moderate to long lengths, and the improvement significantly 
increases as the length grows. Third, our efficient algorithms for p-point cyclic convolution and CCFTs 
allow us to obtain longer DFTs over larger fields, e.g., 2047 -point DFT over GF(2 11 ) and 4095-point DFT 
over GF(2 12 ), which are first efficient DFTs of such lengths to the best of our knowledge. Finally, our 
CCFTs are also advantageous for hardware implementations due to their regular and modular structure. 
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Index Terms 

Discrete Fourier transforms, finite fields, cyclotomic fast Fourier transforms, prime-factor algorithm, 
Cooley-Turkey algorithm 

I. Introduction 

Discrete Fourier transforms (DFTs) over finite fields HI have widespread applications in error correction 
coding, which in turn is used in all digital communication and storage systems. For instance, both 
syndrome computation and Chien search in the syndrome based decoder of Reed-Solomon codes (H, O, 
a family of widely used error control codes, can be formulated as polynomial evaluations and hence can 
be implemented efficiently using DFTs over finite fields. Implementing an iV-point DFT directly requires 
0(N 2 ) multiplications and 0(N 2 ) additions, and becomes costly when N is large. Hence, reducing the 
computational complexities of DFTs is of great significance. Recently, efficient long DFTs have become 
particularly important as increasingly longer error control codes are chosen for digital communication 
and storage systems. For example, Reed-Solomon codes over GF(2 12 ) and with block length of several 
thousands are considered for hard drive [4] and tape storage Q as well as optical communication systems 
||6ll to achieve better error performance; the syndrome based decoder of such codes requires DFTs of 
lengths up to 4095 over GF(2 12 ). In addition to complexity, regular and modular structure of DFTs is 
desirable for efficient hardware implementations. 

In the literature, fast Fourier transforms (FFTs) based on the prime-factor algorithm [7 ] and the Cooley- 
Turkey algorithm |8] have been proposed for DFTs over the complex field. When FFTs based on the 
prime-factor algorithm are adapted to DFTs over finite fields [9], they still have high multiplicative 
complexities. In contrast, recently proposed cyclotomic FFTs (CFFTs) are promising since they have 
significantly lower multiplicative complexities iflOl . ifTTTl . However, CFFTs have two issues. First, they 
rely on efficient algorithms for short cyclic convolutions, which do not always exist. For instance, CFFTs 
over GF(2 11 ) would require efficient algorithms for 11-point cyclic convolutions. Previous works (see, 
for example, lfT0l - lfl2lD have not investigated CFFTs over GF(2 11 ) partially due to the lack of efficient 
11-point cyclic convolutions in the literature. Second, CFFTs have very high additive complexities when 
directly implemented, which can be reduced by techniques such as the common subexpression elimination 
(CSE) (see, for example, lfT2l - lfl"5l ). In particular, the CSE algorithm in |fT2l is effective for reducing 
the additive complexities of CFFTs over GF(2') for I < 10. However, although the CSE algorithm has 
a polynomial complexity [12, Sec. III-F], its time and memory requirements limit its effectiveness for 
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long DFTs. Due to these two issues, CFFTs over GF(2 11 ) and GF(2 12 ) have not been investigated in the 
literature. 

In this paper, we address both aforementioned issues. The main contributions of our paper are as 
follows. 

• For an odd prime p, we reformulate a p-point cyclic convolution over characteristic-2 finite fields as 
a product of a (p — 1) x (p — 1) Toeplitz matrix and a vector. Since p — 1 is composite, this product 
can be readily obtained by multi-dimensional technology from well-known Toeplitz matrix vector 
products (TMVP) of very small sizes ifToTl - EOl . In comparison to other ad hoc techniques based on 
TMVP, our reformulation achieves lower multiplicative complexity, especially for small to moderate 
p. Hence, our reformulation leads to efficient bilinear algorithms for p-point cyclic convolution over 
characteristic-2 finite fields. Our reformulation can be readily extended to the real and complex fields 
as well as more general finite fields. Furthermore, by multi-dimensional technology, we can also 
obtain efficient algorithms for p n -point cyclic convolutions. These algorithms are also key to long 
CFFTs. 

• Due to the high additive complexities of CCFTs, we propose composite cyclotomic Fourier trans- 
forms (CCFTs), which are generalization of CFFTs. When the length N of the DFT is factored, 
that is, N = Ni x N2, our CCFTs use N\- and Appoint CFFTs as sub-DFTs via the prime-factor 
and Cooley-Turkey algorithms. Thus, CFFTs are simply a special case of our CCFTs, corresponding 
to the trivial factorization, i.e., N = 1 x N. This generalization reduces overall complexities in 
three ways. First, this divide-and-conquer strategy itself leads to lower complexities. Second, the 
moderate lengths of the sub-DFTs enable us to apply complexity-reducing techniques such as the 
CSE algorithm in [12] more effectively. Third, when the length N admits different factorizations, 
the one with the lowest complexity is selected. In the end, while an N -point CCFT may have 
a higher multiplicative complexity than an iV-point CFFT, the former achieves a lower overall 
complexity for long DFTs because of its significantly lower additive complexity. Moreover, when N 
is composite, an iV-point CCFT has a regular and modular structure, which is suitable for efficient 
hardware implementations. Our CCFTs provide a systematic approach to designing long DFTs with 
low complexity. 

• Our efficient algorithms for p-point cyclic convolution and CCFTs allow us to obtain longer CFFTs 
over larger fields. For example, we propose CFFTs over GF(2 11 ), which are unavailable in the 
literature heretofore partially due to the lack of efficient 11-point cyclic convolution algorithms. Our 
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2047-point DFTs over GF(2 n ) and 4095-point DFTs over GF(2 12 ) are also first efficient DFTs of 
such lengths to the best of our knowledge, and they are promising for emerging communication 
systems. 

Our work in this paper extends and improves previous works ifTUl . lfT2l on CFFTs over finite fields of 
characteristic two in several ways. First, previously proposed CFFTs focus on (2 l — l)-point CFFTs over 
GF(2') for Z < 10. In contrast, our CCFTs allow us to derive long DFTs with low complexity over larger 
fields. Our approach can be applied to any finite field, but we present CCFTs over GF(2 11 ) and GF(2 12 ) 
due to their significance in applications. Furthermore, our work investigates N -point CFFTs over GF(2^) 
for iV|2 z — 1. Second, our CCFTs achieve lower overall complexities than all previously proposed FFTs 
for moderate to long lengths, and the improvement significantly increases as the length grows. 

The rest of the paper is organized as follows. Sec. [TT] briefly reviews the necessary background of this 
paper, such as the CFFT, the prime-factor algorithm, the Cooley-Turkey algorithm, and the CSE algorithm. 
We propose an efficient bilinear algorithm for p-point cyclic convolutions over GF(2 Z ) in Sec. JII] We 
then use an 11-point cyclic convolution algorithm to construct 2047-point CFFT over GF(2 11 ) in Sec.lVl 
We also propose our CCFTs and compare their complexities with previously proposed FFTs in Sec. [V] 
The advantages of our CCFTs in hardware implementations are discussed in Sec. [VI] Concluding remarks 
are provided in Sec. I VII I 

II. Background 

A. Cyclotomic Fast Fourier Transforms 

In this paper, we consider DFTs over finite fields of characteristic two. Let a G GF(2') be an element 
with order N, which implies that N\2 l — 1 (otherwise a does not exist). Given an ^-dimensional column 
vector f = (/„, /i, • • ■ , /at__i) t over GF(2 J ), the DFT of f is given by F = (F , F x , ■ ■ ■ , F,v_i) T , where 

JV-l 
i=0 

If we define f(x) = Yli=o U x% -> we nave Fj = /(o^)- Directly computing the DFT requires 0(N 2 ) 
multiplications and 0(N 2 ) additions, and is impractical for large /Vs. Cyclotomic FFTs (CFFTs) [10], 
ifm can reduce the multiplicative complexities greatly. 

We first partition the integer set {0, 1, • • • , N — 1} into m cyclotomic cosets modulo N with respect 
to GF(2) [3]: C So ,C Sl ,--- ,C Sm _„ where C Sk = {2°s k , 2 1 s k , ■ ■ ■ ,2 m '=- 1 Sfc } (mod N) and s k = 2 m «s k 
(mod N). A polynomial L(x) = ^2ihx 2 \ where l{ G GF(2'), is called a linearized polynomial over 
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GF(2'), since it has a linear property L(x + y) = L{x) + L(y) for x,y € GF(2'). With the help of 
cyclotomic cosets, f(x) can be decomposed as a sum of linearized polynomials 



TO— 1 



/(x) = J] L fc (x^), L k {x) 



fc=0 



m fe — 1 

£ 

j=0 



Js k 2imodN x 



2< 



■vrra— 1 



Therefore ^ = Y,k=o L k(a JSk ), and each c^ Sfe lies in the subfield GF(2 mfc ) C GF(2' 



Using a normal basis {7^ ,7^ , • • • ,7f m * } in GF(2 TOfc ), a JSfe can be expressed by Y0i=a a iJ,k% > 



ISCU uy ^_ ■ ■ ■ 

where a ijjfc G {0,1}. By the linear property of L;(x)'s, Fj = Y^™=qYh=o~ 1 a hi,k L k{lk)- Written in 
the matrix form, the DFT of f is given by F = ALIIf , where A is an N x N binary matrix constructed 



from the binary coefficients Oj^fc, II is an N x N permutation matrix, L = diag(Lo,Li 



-1) is 



a block diagonal matrix, and L^'s are m k x m k square matrices. The permutation matrix II reorders the 



p/T r/T 



e/T 



Vector f into f — (f , f 1 , ■ ■ ■ , f m _ 1 ) , and f k — {f Sk 2°modN, fs^modN, '•• , fs k 2 m h-^modN) ■ 

Though the idea of cyclotomic decomposition dates back to 1211 . the normal basis representation is a 
key step jTOl . Since 7|'" fc = 7^, the k-th block L^ of L is actually a circulant matrix, which is given by 



2° 

Tk 


2 1 


QT71L. 


2 1 
Tk 


2 2 

7fc ' 


• it 


9 m fe-! 


2° 

Tk ■ 


■ iT k 



Hence the multiplication between L^ and f' k can be formulated as an m^-point cyclic convolution between 
bfc = (7I , t| fc_1 ,7fc fc_2 • • • , t| ) T and f£. Since m^ is usually small, we can use efficient bilinear 
form algorithms [Q for short cyclic convolutions to compute L^fV Those bilinear form algorithms have 
the following form, 

L fc f£ = b fc ® f k = Q fc (Rfcb fc • P fc f£) = Q fc (c fc • P fc f^), 

where P&, Qfc, and R^ are all binary matrices, c k = Rfeb^ is a precomputed constant vector, and • 
denotes an component-wise multiplication between two vectors. Combining all the matrices, we get 



F = AQ(c-Pf'), 



(2) 



T \T 



where Q = diag(Q , Qi, • • • ,Q m -l). P = diag(P ,Pi, ■•• ,P m -i), and c = (c^,cf, 

The multiplications required by ((2]) are due to the component-wise multiplication between c and Pf, 
and the additions required by (O are for multiplications between binary matrices and vectors. Direct 
implementation of CFFT in (f2l) requires much fewer multiplications than the direct implementation of 
DFT, at the expense of a very high additive complexity. 
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B. Common Subexpression Elimination 

Given an N x M binary matrix M and an M-dimensional vector x over a field F. The matrix vector 
multiplication Mx can be done by additions over F only, the number of which is denoted by C(M) 
since the complexity is determined by M, when x is arbitrary. The problem of determining the minimal 
number of additions, denoted by C opt (M), has been shown to be NP-complete [22]. Instead, different 
common subexpression elimination algorithms (see, e.g., lfi~3l - lfT5lO have been proposed to reduce C(M). 
The CSE algorithm proposed in lfi"2l takes advantage of the differential savings and recursive savings, and 
can greatly reduce the number of additions in calculating Mx, although the reduced additive complexity, 
denoted by Ccse(M), is not guaranteed to be the minimum. Like other CSE algorithms, the CSE algorithm 
in lPT2l is randomized, and the reduction results of different runs are not necessarily the same. Therefore in 
practice, a better result can be obtained by first running the CSE algorithm many times and then selecting 
the smallest number of additions. The CSE algorithm in |[T2l greatly reduces the additive and overall 
complexities of CFFTs with lengths up to 1023, but it is much more difficult to reduce the additive 
complexities of longer CFFTs. This is because though the CSE algorithm in IT2l has a polynomial 
complexity (it is shown that its complexity is 0(N 4 + N 3 M 3 )), the runtime and memory requirements 
become prohibitive when M and N are very large, which occurs for long CFFTs. 

C. Prime-Factor and Cooley-Turkey Algorithms 

Both the prime-factor algorithm and Cooley-Turkey algorithm first decompose an iV-point DFT into 
shorter sub-DFTs, and then construct the N -point DFT from the sub-DFTs [T|. The prime-factor algorithm 
requires that the length N has at least two co-prime factors, i.e., there exist two co-prime numbers N\ 
and N 2 such that N = N\N 2 , For an integer i G {0, 1, • • • ,N — 1}, there is a unique integer pair 
(h,i 2 ) such that < i\ < N\ - 1, < i 2 < N 2 - 1, and i = hN 2 + i 2 N x (mod N), since N x and 
A^2 are co-prime. For any integer j € {0, 1, • • • , N — 1}, let j\ = j (mod iVi), j 2 = j (mod N 2 ), 
where < j\ < N± — 1 and < j 2 < N 2 — 1. By Chinese remainder theorem, (ji,j 2 ) uniquely 
determines j, and j can be represented by j = jiN^ 1 ^ + j 2 N^~ N\ (mod N), where N^ 1 ^ = 1 
(mod N±) and A^iVi = 1 (mod N 2 ). Substituting the above representation of i and j in (Q]), we 
get a ij = (a N2 ) iljl (a Nl ) i2J2 , where a* and a Nl are the iVi-th root and iVVth root of 1, respectively. 
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Therefore, £Q) becomes 

N 2 -point DFT 



iVi-1 iV 2 -l 

^ = E ( E /fc^+fa^^ 1 ^' 3 ) aJVailil • (3) 



j 1= i 2 =0 



iVi -point DFT 

In this way, the iV-point DFT is obtained by using N\- and Appoint sub-DFTs. The N -point DFT result 
is derived by first carrying out N\ Appoint DFTs and N 2 N\ -point DFTs, and then combining the results 
according to the representation of j. The prime-factor algorithm can also be applied to N\- and N 2 -point 
DFTs if they have co-prime factors. 

The Cooley-Turkey algorithm has a different decomposition strategy from the prime-factor algorithm. 
Let N = N1N2, where N± and N 2 do not have to be co-prime. Let i = h+i 2 Ni, where < i\ < iVj. — 1 
and < i 2 < N 2 - 1, and j = j x N 2 + 32, where < j\ < TVi - 1 and < j 2 < N 2 - 1. Then © 
becomes 

N 2 -point DFT 



iVi-1 N 2 -l 

F i = E ( E f^N ia N ^ )a^a N ^ . (4) 

i 1= i 2 =0 

V v ' 

N x -point DFT 

In this way, the Cooley-Turkey algorithm also decomposes the iV-point DFT into iVi- and N2 -point 
DFTs. However, compared with ©, (0]) has an extra term a* 1 -? 2 , which is called twiddle factor and incurs 
additional multiplicative complexity. The Cooley-Turkey algorithm can be used for arbitrary non-prime 
length N, including the prime powers to which case the prime-factor algorithm cannot be applied. The 
Cooley-Turkey algorithm is very suitable if N has a lot of small factors, for example, 2 n -point DFT by 
the Cooley-Turkey algorithm requires 0(n ■ 2 n ) multiplications. 

III. p-point Cyclic Convolutions over GF(2 m ) 

Efficient short cyclic convolution algorithms play an essential role in the multiplicative complexity 
reduction of CFFTs. Note the lengths of cyclic convolutions involved in CFFTs are the same as the sizes 
of the conjugate classes. Since the sizes of all possible conjugate classes in GF(2 m ) are divisors of m, 
efficient algorithms for only short cyclic convolutions are needed, since they determine the multiplicative 
complexities of CFFTs. 

Despite their significance, there is no general algorithms for efficient cyclic convolutions of arbitrary 
length over finite fields. Of course, efficient ad hoc algorithms for 2- to 9-point cyclic convolution can 
be found in the literature (4- and 8-point can be found in 1231 - 11251 . and their details are included in 
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Appendix iBl due to their limited access, and the rest can be found in [1] and 0). Furthermore, cyclic 
convolutions with composite length can be constructed with multi-dimensional technology described 
in HI. For instance, 10-point cyclic convolution algorithms can be constructed based on 2- and 5-point 
algorithms, while 12-point cyclic convolution algorithm is constructed based on 3- and 4-point algorithms. 
However, an efficient algorithm for cyclic convolutions of larger prime length (for example, 11- or 
13-point) is not available in the open literature. We can implement these cyclic convolutions via the 
convolution theorem. Although the DFTs and IDFT can be implemented by the Winograd algorithm ll26ll 
or the Rader algorithm |[27l , this approach remains inefficient, especially for small to moderate lengths. 
In ll28l . strategies to derive cyclic convolution algorithms directly over any finite field GF(q m ) were 
developed. Unfortunately, these methods are applicable only to lengths q m — 1 or their factors. 

Herein for an odd prime p, we reformulate a p-point cyclic convolution as a product of a (p — 1) x 
(p — 1) Toeplitz matrix and a vector. Since p — 1 is composite, this product can be readily obtained by 
multi-dimensional technology from well-known TMVP of very small sizes, leading to efficient bilinear 
algorithms for p-point cyclic convolutions. Since these cyclic convolutions will be used for CFFTs over 
GF(2'), we focus on cyclic convolutions over GF(2*). However, our reformulation can be readily extended 
to the real and complex fields as well as more general finite fields. Furthermore, by multi-dimensional 
technology, we can also obtain efficient algorithms for p n -point cyclic convolutions. These algorithms 
are also key to long CFFTs. 

For a p-dimensional vector x = (xq,xx, ■ ■ ■ ,x p _i) T over some field GF(2 ; ), where p is any odd 
prime integer, we consider its corresponding polynomial X(w) = YmZq xiw 1 . Assuming that the p-point 
cyclic convolution of two vectors x and y is z, all of which are p-dimensional vectors over GF(2 ; ), their 
corresponding polynomials are related by JH 

Z(w) = X(w)Y(w) (mod w p + 1). (5) 

Note that w p + 1 = (w + l)(w p_1 + w p ~ 2 H h 1), and w + 1 and w p ~ l + w p ~ 2 + • • • + 1 are 

co-prime in GF(2*). Hence by Chinese remainder theorem, Z(w) can be uniquely determined by Zq and 

yp-2 yl 

Zq = Z(w) (mod w + 1), 
Z'(w) = Z(w) (mod wP- 1 + w p - 2 + ••• + !). 



Z'W = Ef=o ^X > where 



(6) 



It is easy to see that Zq = Ym=q z i> Z[ = Zi + z p -\, and the vector Z^ = (Zq, Z q , Z[, ■ ■ ■ , Z' 



p-2J can 
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be derived by multiplying the vector z with an p x p matrix B with structure 

1 1 ... 1 



B 



1 



%>-i 



where l p -\ is a (p — 1) x (p — 1) identity matrix. That is, Z* = Bz. 

To compute the p-point cyclic convolution of x and y, we first compute X^ = Bx and Y* = By, then 
compute Z^ from X^ and Y^, and finally, z = B _1 Z^. With the same partitioning scheme aforementioned 
and equations (f5]) and ©, it is easy to see that Zq = XqYq, and 



Z'(w) = X'(w)Y\w) (mod w^ 1 + w p ~ 2 + 



+ D 



(7) 



and hence we can compute Z^ = (Zq, Z /T ) t . 

From (O, the polynomial product can be computed as 

p-2 p-2 

X'(w)Y'{w) = J2 E( y fc-i + Y k-3+v + Y P-i-j) X 'j 



w 



k=0 j=0 



(mod w p ~ L + w v ~ l -\ hi 



P-2 



(8) 



and hence the vector Z' can be computed through a matrix product Z' = MX', where the elements of 
matrix M are 

M kJ = Y^ + Y>_ j+p + Y^^. (9) 

Note that in ((8]) and (O, Y( are considered as zero outside its valid range, i.e., Y( = if i < or 
i > p-2. 

We can check that B is an invertible matrix, and B _1 is given by 

1 Aj 
A 2 A 3 

where the length-(p — 1) row vector Ai — (0, 1, 1, • • • , 1), the length-(p — 1) column vector A 2 = 
(1, 1, • • • , 1) T , and (p — 1) X (p — 1) matrix A3 has on the first upper diagonal and 1 everywhere else. 
Now consider the product of B _1 and a length-p column vector U: 



B 



-1 



Uo 




1 Ax 




Uo 




V Q 


U' 




A 2 A 3 




U' 




V 



B 



where Uo, U', and Vq, V are appropriate partitions of the vector U and the multiplication result vector 
V, respectively. Values of Vq and V can be computed as Vq = Uq + AiU' and V = A2U0 + A3U 7 . 
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Note that Ai and A3 are related asAi = (l,l,...,l) A3. This implies that the sum of the components 
of AsU' gives AiU'. Furthermore, A2 contains only l's. Thus the computation of Vq and V' reduces 
to 

V = U + V(A 3 U') 

(10) 

V' = [Uo,U ,...,U f + A 3 U'. 

Eq. ( fTOl ) shows that multiplying a vector with B _1 needs only an evaluation of A3TJ 7 . 

The cyclic convolution result z is obtained by first multiplying A3 and Z'. Thus one need to compute 
RX' where the (p — 1) x (p — 1) matrix R = A3M. We now show by direct computation that R is a 
Toeplitz matrix. From the structure of A3, we have 

P-2 

Ri d = M i+1)j +J2 M kj- (11) 

fc=0 

^From ©, using appropriate ranges for the three terms we get 

p-2 p-2 

Y, M *4= Y V-X-3+Y, Y »' (12) 

fc=0 s=0 

Finally, combining (0, (ITTb and (fl2l ) gives 

p-2 

-^»J = ^J-j+1 + ^i-j+p+1 + ^ ^s • (13) 

s=0 

Since Rij is a function of only i — j, R is a Toeplitz matrix. Recall that Y- is assumed zero if its index 
is outside the valid range from to p — 2. Thus in ([T3T ), at most one of the first two terms is valid for 
any combination of i and j. 

Fig. CD illustrates our algorithm for p-point cyclic convolutions, which relies on the implementation of 
RX'. Direct implementation of RX' requires (p — l) 2 multiplications, but we can reduce it since R is a 
Toeplitz matrix. For any odd prime p > 3, p — 1 is composite and RX' can be obtained by using multi- 
dimensional technology from TMVP of smaller sizes lTT6l - |[20l . For example, CFFTs over GF(2 11 ), 
GF(2 13 ), GF(2 17 ), and GF(2 19 ), involve 11-, 13-, 17-, and 19-point cyclic convolutions, respectively. 
Using our reformulations, these cyclic convolutions can be obtained from a TMVP of 2 x 2, 3 x 3, and 
5x5, which are provided in Appendix |A) Hence our reformulation leads to efficient cyclic convolution 
algorithms for odd prime p for p < 19, which are sufficient for all CFFTs over characteristic-2 fields as 
large as GF(2 19 ). 

This reformulation is also applicable to a prime greater than 19, where p — 1 may have a prime factor p' 
greater than five. In this case, one can use two ad hoc techniques to proceed. First, one can break a p' x p' 
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x 



X\ 



x 2 



Xp-2 



Xp—\ 




y -\- yi + ... \-y p _ x 

x ) K d ! ) - 2:0 




(p-1) x (p-1) 

Toeplitz matrix 

R 




21 



^2 



Z 



p-2 



2 



p— 1 



Fig. 1. p-point cyclic convolution. 



matrix into blocks, and treat them separately. Second, one can extend the p' x p' matrix to a larger matrix 
so that it remains a Toeplitz matrix and its size becomes composite again. The complexities of cyclic 
convolution algorithms obtained through this reformulation are much smaller than direct implementation. 
For example, we can first extend the (p— 1) x (p— 1) Toeplitz matrix to a 2^ log2 ^~ 1 ^ x 2' log2 ( p ~ 1 )l matrix, 
and it requires fewer than sT^Cp- 1 )! multiplications if we use the two-way split method described in 
Hi. 

We note that a p-point cyclic convolution can be formulated as a p x p circulant matrix vector product. 
Since a circulant matrix is a special case of Toeplitz matrix, one can of course apply the two ad hoc 
techniques described above to this p x p Toeplitz matrix directly. However, since our reformulation turns 
a p-point cyclic convolution into a(p— l)x(p — 1) TMVP, which directly benefit from multi-dimensional 
technologies, at the expense of only one extra multiplication, we believe our reformulation will lead to 
lower multiplicative complexity. We cannot prove this analytically, but will illustrate this point below 
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12 

with an example. 

We also remark that our reformulation leads to bilinear algorithms for cyclic convolutions, which can 
be implemented efficiently since the pre- and post-addition matrices are all binary. 

A. Example: 11-point Convolution Algorithm over GF(2 m ) 

To illustrate the advantages of our reformulation above, we derive our efficient 11-point cyclic con- 
volution algorithm over GF(2 11 ) and compare its multiplicative complexity with some other approaches. 
By using well-known 2x2 and 5x5 TMVP, we obtain an 11-point cyclic convolution algorithm 
z = Q( 11 )(R( 11 )y • p( n )x), where the matrices Q (11) , P (11) , and R (11) are given in AppendixE Since 
the 10 x 10 TMVP requires 42 multiplications, our 1 1 -point cyclic convolution requires 43 multiplications. 

Let us compare this multiplicative complexity with the two ad hoc techniques. First, we can partition 
the 11 x 11 circulant matrix into a 10 x 10 Toeplitz matrix, a 10 x 1 column vector, a 1 x 10 row vector, 
and a single element, and then apply the multi-dimensional technology to the 10 x 10 TMVP. In addition 
to the 10 x 10 TMVP, this approach requires 21 extra multiplications, as opposed to one in our approach. 
Second, we can extend the 11 x 11 circulant matrix to a 12 x 12 Toeplitz matrix, and then apply the 
multi-dimensional technology to this matrix. A 12 x 12 TMVP requires 54 = 3 x 3 x 6 multiplications. 
Taking into account that we pad a zero to the 11 x 1 vector and that the last element of the TMVP is 
not needed, two multiplications can be saved, and we need 52 multiplications in total (note that this total 
multiplicative complexity is the same regardless of the order of decomposition of 12). We can also extend 
the 11 x 11 circulant matrix to a 15 x 15 Toeplitz matrix or a 16 x 16 one, which require 66 and 60 
multiplications, respectively. Our reformulation is more efficient than these ad hoc techniques in terms of 
the multiplicative complexity. This is because our reformulation turns a p-point cyclic convolution into 
a (p — 1) X (p — 1) TMVP, which directly benefit from multi-dimensional technologies, at the expense 
of only one extra multiplication. 

We also compare our result with the implementation via convolution theorem, i.e., first multiply the 
DFTs of the two vector component-wisely, and then compute the inverse DFT of the resulting vector. If 
we use the Rader's algorithm to implement the DFT and inverse DFT, it needs 101 multiplications in 
total. Hence this approach is less efficient than ours. 

By using the CSE algorithm in (12), our 11-point cyclic convolution algorithm requires 43 multipli- 
cations and 164 additions. When we use this algorithm in CFFTs over GF(2 11 ), one of the two inputs 
is known in advance. Our algorithm requires 42 multiplications since one of the multiplication has an 
operand of one, and 120 additions because the additions involving the known input can be pre-computed. 
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IV. Long Cyclotomic Fourier Transforms 

A. 2047 -point CFFT over GF(2 U ) 

The efficient algorithm for 11 -point cyclic convolution we designed in IIII-AI is the key to the CFFTs 
over GF(2 11 ). Direct implementation of 2047-point CFFT with this cyclic convolution algorithm requires 
7812 multiplications and 2130248 additions. The prohibitively high additive complexity is dominated by 
the multiplication between the 2047 x 2047 matrix A and a 2047 -dimensional vector, which requires 
2095280 additions. Unfortunately, if we use the CSE algorithm in lfT2l to reduce its additive complexity, 
the time complexity of the CSE algorithm itself is too high (it needs months to finish). 

Due to the high time complexity of the CSE algorithm in lfi"2l . we have tried a simplified CSE algorithm 
with limited success. In the original CSE algorithm in [12], only one of the patterns with the greatest 
recursive savings is selected and removed in one round of iteration. Instead of selecting only one pattern, 
our simplified CSE algorithm has a reduced time complexity as it removes multiple patterns at one time. 
The reduced time complexity of the simplified CSE algorithm allows us to reduce the additive complexity 
for the 2047-point CFFT to 529720 additions, about one fourth of that for the direct implementation. 
Despite this improvement, the effectiveness of this simplified CSE algorithm is rather limited. 

B. Difficulty with Long CFFTs 

Consider an iV-point CFFT over GF(2'). Suppose C So , C Sl , • • • , C Sm _ 1 are m cyclotomic cosets modulo 
N over GF(2), and \C Sk \ = m^. Suppose an m^-point cyclic convolution can be done with M.{mk) 
multiplications, and hence implementing the iV-point DFT with the CFFT directly requires 2~2T=o ■^( rn k) 
multiplications and C(AQ) + C(P) additions, where C(-) denotes the number of additions we need to 
evaluate the product of a binary matrix and a vector. The multiplicative complexity can be further reduced 
because we can pre-compute the vector c in ((2]) and some of its elements may be unitary. Then the 
CSE algorithm can be applied to the matrices AQ and P to reduce C(AQ) and C(P) to Ccse(AQ) 
and Ccse(P)> respectively. Since P = diag(Po,Pi, • • • ,P m -i) is a block diagonal matrix, we have 
Ccse(P) = Ya^Lo CcSE(Pi)- Therefore, we can reduce the additive complexity of each P, to get a 
better result of C(P). Since the size of Pj is much smaller than that of P, it allows us to run the CSE 
algorithm many times to achieve a smaller additive complexity. However, the matrix AQ is not a block 
diagonal matrix, and therefore we have to apply the CSE algorithm directly to AQ. When the size of 
AQ is large, the CSE algorithm in [12] requires a lot of time and memory and hence it is impractical for 
extremely long DFTs. As mentioned above, it would take months for the CSE algorithm in |fl~2l to reduce 
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the additive complexity of 2047-point CFFT over GF(2 n ), let alone 4095-point CFFTs over GF(2 12 ). 
The prohibitively high time complexity of the CSE algorithm in [12J and the limited effectiveness of the 
simplified CSE algorithm motivate our composite cyclotomic Fourier transforms. 

V. Composite Cyclotomic Fourier Transforms 
A. Composite Cyclotomic Fourier Transforms 

Instead of simplifying the CSE algorithm or designing other low complexity optimization algorithms, 
we propose composite cyclotomic Fourier transforms by first decomposing a long DFT into shorter sub- 
DFTs, via the prime-factor or Cooley-Turkey algorithms, and then implementing the sub-DFTs by CFFTs. 
Note that both the decompositions require only that a is a primitive iV-th root of 1, hence they can be 
extended to finite fields easily. When N is prime, our CCFTs reduce to CFFTs. When N is composite, 
we first decompose the DFT into shorter sub-DFTs, and then combine the sub-DFT results according to 
© or (0]). The shorter sub-DFTs are implemented by CFFTs to reduce their multiplicative complexities, 
and then we use the CSE algorithm in [12] to reduce their additive complexities. Finally, when JV has 
multiple factors, the factorization can be carried out recursively. 

Suppose the length of the DFT is composite, i.e., N = N1N2. Either the prime-factor or the Cooley- 
Turkey algorithms can be used to decompose the iV-point DFT into sub-DFTs when N\ and N2 are 
co-prime. When N\ and N2 are not co-prime, only the Cooley-Turkey algorithm can be used. It is easy 
to show that if N\ and N% are co-prime, the prime-factor and Cooley-Turkey algorithms lead to the 
same additive complexity for CCFTs, but the Cooley-Turkey algorithm results in a higher multiplicative 
complexity due to the extra multiplications of twiddle factors. Hence the prime-factor algorithm is better 
than the Cooley-Turkey algorithm in this case, and the Cooley-Turkey algorithm is used only if the 
prime-factor algorithm cannot be applied. 

We denote the multiplicative and additive complexity of an iV-point DFT by /C mult (iV) and /C add (iV), 

respectively, and the algorithm used to implement this DFT is specified in the subscription of K. Suppose 

N = Yli=i Ni, and the total number of non-unitary twiddle factors required by the Cooley-Turkey 

algorithm decompositions is denoted by T, then the complexity of this decomposition is given by 

s N 
KCCFXW = E ^CFFtW), (14) 

i=l l 

s N 
Kccft(A0 = E j^S&rW) + T. (15) 

4 = 1 
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For N\2 l — 1 for 4 < I < 12, there is at most one pair of JVj's that are not co-prime in the decomposition 
of N, say N% and N 2 , without loss of generality. In this case, T = j/j^{Ni - 1)(JV 2 - 1). If all the 
elements in the decomposition of N are co-prime to each other, then T = 0. 

The decomposition allows our CCFTs to achieve low complexities for several reasons. First, this divide - 
and-conquer strategy is used in many fast Fourier transforms. If we assume CFFTs have quadratic additive 
complexities with their length N when directly implemented (this assumption is at least supported by 
the additive complexities of the CFFTs without CSE in Table HVl ). the CCFT decomposition reduces 
the additive complexity from 0(N 2 ) to 0(N^2^ =1 iVj). Second, the lengths of the sub-DFTs are much 
shorter, which enables us to apply several powerful but complicated techniques to reduce the complexities 
of the sub-DFTs. For example, it takes much less time and memory to apply the CSE algorithm in lfi"2l 
to the sub-DFTs, and thus we can run it multiple times to get a better reduction result. Third, when the 
length of the DFT admits different factorizations (for example, 2 6 — 1 = 63 = 3 x 21 = 9 x 7), we 
choose the decomposition(s) with the lowest complexity. 

B. Complexity Reduction 

We reduce the additive complexities of our CCFTs in three steps. First, we reduce the complexities of 
short cyclic convolutions. Second, we use these short cyclic convolutions to construct CFFTs of moderate 
lengths. Third, we use CFFTs of moderate lengths as sub-DFTs to construct our CCFTs. 

Efficient short cyclic convolution algorithms are the keys to the multiplicative complexity reduction of 
CFFTs and our CCFTs, and hence our first step is to reduce the computational complexities of small size 
cyclic convolutions. Suppose an L-point cyclic convolution b^®a^ is calculated with the bilinear form 
Q(i)(R(i)b( L ) .p( L ) a ( L )). Since b( L ) is the normal basis in our CCFTs, R^b^ can be precomputed to 
reduce multiplicative complexity. We apply the CSE algorithm in [12] to reduce the additive complexities 
in the multiplication with binary matrices Q( L ) and p( L ). The complexity reduction results Ccse(Q ), 
Ccse(P ). the total additive complexity Ccse(Q ) + Ccse(P )> an ^ tne multiplicative complexities 
are listed in Table U 

The second step is to reduce the additive complexity of CFFTs with moderate lengths, which will be 
used to build our CCFTs. Their moderate lengths allow us to use multiple techniques to reduce their 
additive complexities. 

• First, for any CFFT, we run the CSE algorithm in lfi"2l multiple times and then choose the best 
results. 
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TABLE I 

Complexities of short cyclic convolutions over GF(2'). 



L 


mult. 




additive 


complexities 




Ccs 


e(Q (L) ) 


Ccse(P (l) ) 


total 


2 


1 




2 


1 


3 


3 


3 




5 


4 


9 


4 


5 




9 


4 


13 


5 


9 




16 


10 


26 


6 


10 




21 


11 


32 


7 


12 




24 


23 


47 


8 


19 




35 


16 


51 


9 


18 




40 


31 


71 


10 


28 




52 


31 


83 


11 


42 




76 


44 


120 


12 


32 




53 


34 


87 



Second, for each CFFT in ©, we may reduce C(AQ) together as a whole, or reduce C(A) and 
C(Q) separately. Since (AQ)v = A(Qv), C opt (AQ) < C opt (A) + C opt (Q). However, this property 
may not hold for the CSE algorithm because the CSE algorithm may not find the optimal solutions. 
Furthermore, we may benefit from reducing C(A) and C(Q) separately for the following reasons. 
First, Q has a block diagonal structure, which is similar as P, therefore we can find a better reduction 
result for C(Q). Second, AQ has much more columns than A, and hence the CSE algorithm requires 
less memory and time to reduce A than to reduce AQ. 

Third, there is flexibility in terms of normal bases used to construct the matrix A in ©, and this 
flexibility can be used to further reduce the additive complexity of any CFFT. For each cyclotomic 
coset, a normal basis is needed. A normal basis is not unique in finite fields, and any normal basis 
can be used in the construction of the matrix A, leading to the same multiplicative complexity. 
But different normal bases result in different A and hence different additive complexities due to 
A. There are several options regarding the normal basis. One can simply choose a fixed normal 
basis for all cyclotomic cosets of the same size as in [12]. A more ideal option is to enumerate 
all possible normal bases and their corresponding A and to select the smallest additive complexity. 
However, when the underlying field is large, the number of possible normal basis is very large, and 
hence it becomes infeasible to enumerate all possible constructions. Thus, in this paper we use a 
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compromise of these two options: for each cyclotomic coset we choose a normal basis at random 

and the combination of random normal bases leads to A; we minimize the complexity over as many 

combinations as complexity permits. We refer to this as a random normal basis option. 

We emphasize that all three techniques require multiple runs of the CSE algorithm. Since the time and 

memory requirements of the CSE algorithm grows with the length of DFT, the moderate lengths of the 

sub-DFTs is the key enabler of these techniques. 

For any k < 320 so that k\2 — 1 (4 < I < 12), the multiplicative and additive complexities of 
the k -point CFFT are shown in Table [III Table [TT] shows four different schemes to reduce the additive 
complexity for CFFTs. Schemes A and B both use the fixed normal basis option in the construction of 
the matrix A, while schemes C and D are based on the random normal basis option. Schemes A and 
C reduce C(A) and C(Q) separately, while schemes B and D reduces C(AQ) as a whole. For smaller 
CFFTs, we typically minimize the complexity over hundreds of combinations of normal bases, and fewer 
combinations for longer CFFTs. In Table [TTJ the smallest additive complexities are in boldface font. We 
observe that the random normal basis option offers further additive complexity reduction in most of the 
cases. However, since the fixed normal basis is not necessarily one of the combinations, in some cases 
the fixed normal basis option outperforms the random normal basis option. Also, sometimes applying the 
CSE to AQ together as a whole leads to lower complexity, and in some cases it is better to apply the 
CSE to A and Q separately. 

In the third step, we use the CFFTs with moderate lengths in Table [II] as sub-DFTs to construct our 
CCFTs. With flU and (Q3]>, the computational complexities of our CCFTs over GF(2') (4 < I < 12) with 
non-prime lengths can be calculated. The results are summarized in Table [III], where the factorizations 
in parentheses are not co-prime and the Cooley-Turkey algorithm is used in these cases. We have tried 
all the decompositions with lengths smaller than 320, and the decompositions with the smallest overall 
complexities are listed in Table [Illj Note that for each sub-DFT, the scheme with the smallest additive 
complexity listed in Table [II] is used in the CCFT implementation to reduce the total additive complexity. 
We also note that all DFT lengths in Table [III] are composite. The prime lengths are omitted because 
when N is prime, an iV-point CCFT reduces to an N -point CFFT, which can be found in Table [III 

Since some lengths of the DFTs have more than one decomposition, it is possible that one decompo- 
sition scheme has a smaller additive complexity but a larger multiplicative complexity than another one. 
Therefore, we need a metric to compare the overall complexities between different decompositions. In 
this paper, we follow our previous work |[T2l and assume that the complexity of a multiplication over 
GF(2') is 2/ — 1 times of that of an addition over the same field, and the total complexity of a DFT 
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TABLE II 
The complexities of the CFFTs whose lengths are less than 320 and are factors of 2 l - 1 for 1 < I < 12. 



N 


I 


mult. 


additive complexities 


A 


B 


C 


D 


3 


2 


1 


6 


6 


6 


6 


5 


4 


5 


20 


16 


20 


16 


7 


3 


6 


31 


24 


31 


24 


9 


6 


11 


51 


48 


51 


48 


11 


10 


28 


109 


102 


102 


84 


13 


12 


32 


125 


100 


110 


91 


15 


4 


16 


87 


74 


87 


74 


17 


8 


38 


153 


163 


151 


153 


21 


6 


27 


167 


179 


147 


153 


23 


11 


84 


335 


407 


323 


357 


31 


5 


54 


354 


299 


335 


350 


33 


10 


85 


413 


440 


404 


434 


35 


12 


75 


406 


303 


358 


299 


39 


12 


97 


502 


425 


472 


391 


45 


12 


90 


481 


415 


498 


414 


51 


8 


115 


641 


755 


676 


739 


63 


6 


97 


798 


759 


806 


1031 


65 


12 


165 


1092 


901 


1114 


915 


73 


9 


144 


1498 


1567 


1447 


1526 


85 


8 


195 


1601 


1816 


1589 


1810 


89 


11 


336 


2085 


4326 


2247 


3973 


91 


12 


230 


1668 


1431 


1596 


1421 


93 


10 


223 


1772 


1939 


1736 


1788 


105 


12 


234 


1762 


1481 


1776 


1333 


117 


12 


299 


2304 


2028 


2366 


1947 


195 


12 


496 


4900 


4230 


4942 


4166 


273 


12 


699 


8064 


7217 


8082 


7223 


315 


12 


752 


8965 


8032 


9899 


8099 



is a weighted sum of the additive and multiplicative complexities, i.e., total = (2/ — 1) x mult + add. 
This assumption is based on both the software and hardware implementation considerations lfl2l . Table 
Hill lists the decompositions with the smallest overall complexities. 

Tables Hill provide complexities of all iV-point DFTs over GF(2') when N\2 l — 1 and 4 < / < 12. 
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TABLE III 

The smallest complexity of our TV-point CCFTs over GF(2 ! ) for composite iV and 7V|2 z — 1 for 4 < I < 12 

(we assume the sub-DFTs are shorter than 320). 



1 


Length 


Decomposition 


mult. 


add. 


total 


4 


15 


1 X 


15 


16 


74 


186 




9 


(3 x 3) 


10 


36 


146 


6 


21 


3 x 


7 


25 


114 


389 




63 


(3 x 3) x 7 


124 


468 


1832 




51 


1 X 


51 


115 


641 


2366 


8 


85 


1 X 


85 


195 


1590 


4515 




255 


3 x 


85 


670 


5277 


15327 


9 


511 


7x 


73 


1446 


11881 


36463 




33 


1 X 


33 


85 


404 


2019 


10 


93 


3x 


31 


193 


1083 


4750 


341 


1 X 


341 


922 


15184 


32702 




1023 


33 


< 31 


4417 


22391 


106314 


11 


2047 


23 


< 89 


15204 


76702 


395986 




35 


5 x 


7 


65 


232 


1727 




39 


1 X 


39 


97 


391 


2622 




45 


(3 x 15) 


91 


312 


2405 




65 


1 X 


65 


165 


902 


4697 




91 


1 X 


93 


230 


1421 


6711 




105 


7x 


15 


202 


878 


5524 




117 


1 X 


117 


299 


1947 


8824 


12 


195 


3x 


65 


560 


3093 


15973 




273 


3 x 


91 


781 


4809 


22772 




315 


5 x 


63 


800 


4803 


23203 




455 


7x 


65 


1545 


7867 


43402 




585 


5 x 


117 


2080 


11607 


59447 




819 


7x 


117 


2795 


16437 


80722 




1365 


7x 


195 


4642 


33842 


140608 




4095 


65 


< 63 


16700 


106098 


490198 



Note that the decomposition corresponding to 1 x N is merely the iV-point CFFT over GF(2'). We have 
used the simplified CSE algorithm described in Sec. IIV-AI to reduce the complexity of the 2047-point 
CFFTs over GF(2 11 ), and applied the CSE algorithm in lfl2l to the other CFFTs. Thus, we have expanded 
the results of [12], where only the (2 l — l)-point CFFTs over GF(2') were given. We also observe that 



December 14, 2010 



DRAFT 



20 

for some short lengths (see, for example, N = 15, 33, or 65), the iV-point CFFTs lead to the lowest 
complexity for the iV-point CCFTs. For the DFTs with lengths larger than 320, i.e., 511-point CFFTs 
over GF(2 9 ), 341-point CFFTs over GF(2 10 ), and 455-, 585-, 8 19-, and 1365-point CFFTs over GF(2 12 ), 
the time complexity of the CSE algorithm in lfl2l is still considerable. Thus, we cannot minimize their 
complexities using schemes A, B, C, and D, and hence they are not listed in Table JT] 

Although the twiddle factors in the Cooley-Turkey algorithm decomposition incur extra multiplicative 
complexity, Tables [III] show that the Cooley-Turkey algorithm decomposition reduces the total complexity 
of our CCFTs in some cases (the decompositions in parentheses). For example, while 9-point CFFT 
requires 1 1 multiplications and 48 additions, 3x3 CCFT based on the Cooley-Turkey algorithm decom- 
position requires 10 multiplications and 36 additions. Despite the twiddle factors, the CCFT based on 
the Cooley-Turkey algorithm decomposition have lower multiplicative and additive complexities, because 
the Cooley-Turkey algorithm decomposition allows us to take advantage of the low complexity of the 
3-point DFT 

C. Complexity Comparison and Analysis 

TABLE IV 
Comparison of the complexities our TV-point CCFTs with FFTs available in the literature. 



N 


Field 


Wang and Zhu [291 


Trung et al. |9] 


CFFT 


CCFT 


mult. 


add. 


total 


mult. 


add. 


total 


mult. 


w/o CSE 


w/ CSE 02) 


mult. 


add. 


total 


add. 


total 


add. 


total 


15 


GF(2 4 ) 


41 


97 


384 


- 


- 


- 


16 


201 


313 


74 


186 


20 


78 


218 


63 


GF(2 6 ) 


801 


801 


9612 


- 


- 


- 


97 


2527 


3594 


759 


1826 


124 


468 


1832 


255 


GF(2 8 ) 


1665 


5377 


30352 


1135 


3887 


20902 


586 


34783 


43573 


6736 


15526 


670 


5277 


15327 


511 


GF(2 9 ) 


13313 


13313 


239634 


6516 


17506 


128278 


1014 


141710 


158948 


23130 


40368 


1446 


11881 


36463 


1023 


GF(2 10 ) 


32257 


32257 


645140 


5915 


30547 


142932 


2827 


536093 


589806 


75360 


129073 


4417 


22391 


106314 


2047 


GF(2 n ) 


78601 


78601 


1689622 


- 


- 


- 


7812 


2130248 


2294300 


- 


- 


15204 


76702 


395986 


4095 


GF(2 12 ) 


180225 


180225 


4325400 


- 


- 


- 


10832 


8434414 


8683550 


- 


- 


16700 


106098 


490198 



We compare the complexities of our CCFTs with those of previously proposed FFTs in the literature 
in Table [TV] For each length, the lowest total complexity is in boldface font. In Table |IVJ our CCFTs 
achieve the lowest complexities for N > 255. Although the algorithm in J29l is proved asymptotically 
fast, the complexities of our CCFTs are only a fraction of those in |29l , and the advantage grows 
as the length increases. Although the FFTs in Q are also based on the prime-factor algorithm, our 
CCFTs achieve lower complexities for two reasons. Since our CCFTs use CFFTs as the sub-DFTs, 
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the multiplicative complexities of our CCFTs are greatly reduced compared with the FFTs in [9]. For 
example, the multiplicative complexity of our 511-point CCFT is only one fourth of the prime-factor 
algorithm in 0. Furthermore, using the powerful CSE algorithm in lfT2l . the additive complexities of 
our CCFTs are also greatly reduced. Compared with the CFFTs, our CCFTs have a somewhat higher 
multiplicative complexities, but this is more than made up by reduced additive complexities of our 
CCFTs. The additive complexities of our CCFTs are only a small fraction of those of CFFTs when 
directly implemented. Compared with the CFFTs with reduced additive complexities in fl2l . our CCFTs 
still have much smaller additive complexities due to their decomposition structure for iV > 63. For 
example, the additive complexities of our CCFT is only about half of that of the CFFT for N = 511, 
and one third for N = 1023. Due to the significant reduction of the additive complexities, the total 
complexities of our CCFTs with N > 255 are lower than those of CFFTs. In comparison to CFFTs, the 
improvement by our CCFTs also grows as the length increases. 

For the DFTs whose lengths are prime, such as 31-point DFT over GF(2 5 ), 127-point DFT over 
GF(2 7 ), and 8191-point DFT over GF(2 13 ), our CCFTs reduce to the CFFTs, and they have the same 
computational complexities. 

VI. Regular and Modular Structure of Our CCFTs 

We have shown that our CCFTs lead to lower complexities for moderate to long lengths. Regardless 
of the length, our CCFTs also have advantages in hardware implementations due to their regular and 
modular structure. 



Pf 



IHI 


C Pf 




X 


^- 

»■ 


+ 



Fig. 2. The structure of the CFFTs. 

The CFFT algorithm has a bilinear form, and therefore its circuitry can be divided into three parts 
as shown in Fig. [2] The input vector f first goes through an pre-addition network, which reorders f 
into f and then computes Pf. Then the resulting vector is sent to a multiplicative network, in which 
the component-wise product of c and Pf' is computed. The DFT result F is finally computed in the 
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post-addition network which corresponds to the linear transform AQ. While the structure of the CFFT 
looks simple, the two additive networks are very complex for long DFTs. Although we can reduce the 
additive complexity by the CSE algorithm, the resulted additive networks still require a large number 
of additions. Furthermore, the additions due to A or AQ (the second additive network in Fig. [2]) lack 
regularity, and hence it is hard to use architectural techniques such as folding and pipelining to achieve 
smaller area or high throughput. 

In contrast, our CCFTs have regular and modular structure since they are decomposed into shorter 
sub-DFTs. The sub-DFTs can be implemented much easier than the long ones, and they can be reused 
in the CCFT architecture. Fig. [3] shows the regular and modular structure of a 3 x 5 CCFT. Instead of 
designing the 15 -point CFFT directly, we only need to design a 3 -point CFFT module and a 5 -point 
CFFT module, and compute the 15-point CCFT by reusing these modules according to the structure 
shown in Fig. [3] It is much easier to apply architectural techniques such as folding and pipelining to this 
regular and modular structure, leading to efficient hardware implementations. 
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Fig. 3. The regular and modular structure of our 15-point CCFT based on a 3 x 5 decomposition. 



VII. Conclusion 

For any odd prime integer p, we reformulate p-point cyclic convolution as a (p — 1) x (p — 1) Toeplitz 
matrix vector product, leading to efficient cyclic convolution algorithms. Based on this reformulation, 
we have obtained efficient 11-point cyclic convolution algorithm and derived the CFFTs over GF(2 n ). 
We have shown that our composite cyclotomic Fourier transform algorithm leads to lower complexities 
through decomposing long DFTs into shorter ones using the prime-factor or Cooley-Turkey algorithms. 
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Our CCFTs over GF(2') (4 < I < 12), have lower complexities than previously known FFTs over finite 
fields. They also have a regular and modular structure, which is desirable in hardware implementations. 



Appendix A 
Short Toeplitz Matrix Vector Product over GF(2') 

Annxn TMVP over GF(2') as 



"o 

Ml 



r n -i r n 

r n _ 2 r n _i 
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T2n-2 




vo 


r 2n~3 




v\ 


r n -i 




V n -1 



_n n _ij |_ r r\ 

can be computed with bilinear algorithm EW(G^r • H^ n V), where r 
{vq,v\, ■ ■ ■ ,v n -i) T , and E( n ), G^ and H( n ) are all binary matrices. 
For n = 2 (see, for example, fl"8l . ll20l ). 
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101 
Oil 

For n = 3 (see, for example, EOl ). 
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011 
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010 
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010 
100 
011 
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For n = 5, 



E (5) 



111110000 
011111000 
001111100 
000111110 
000011111 
010010000 
001000000 
000110000 
000100000 
000011000 
000001000 
000000100 
000010010 
000010000 



G (5) 



10000 
01000 
00100 
00010 
00001 
11000 
10100 
10010 
01100 
01001 
00110 
00101 
00011 
11011 



H (6) 



00001000010111 
00010001001011 
00100010101100 
01000100110001 
10000111000001 

Appendix B 
4-, 8-, and 11-poiNT Cyclic Convolution Algorithms over GF(2'} 



11110 
110 11110 
1111110 
110 111 



For 4-point cyclic convolutions, (241 



Q 



(4) 
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r< 4 ) 



p(4) 



10 10 
10 
110 

1111 

10 10 

1111 

11 

1111 
1111 



110 

1111 

10 10 
10 

1111 

10 1 
10 1 
10 

1111 

For 8 -point cyclic convolution |[25l , 

110000110110000000000000110 
101000101101000000000000111 
110000110000110000000110000 
101000101000101000000111000 
110110110110000000110000000 
101101101101000000111000000 
110110110000110110000000000 
101101101000101111000000000 

111111111111111111111111111 

001001001001011011011011011 
000000000000111111111111111 
000010010000001011011011011 
000111111000000111111111111 
000001001010000011011011011 
000000000111000111111111111 
010010010011010011011011011 



Q 



(8) 



(R( § )) T 
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and 



(p(8))T 



101101000000101101000000000 
110110000000110111000000000 
101101000101000000101000000 
110110000110000000111000000 
101101101000101000000101000 
110110110000110000000111000 
101101101101000000000000101 
110110110110000000000000111 

For our new 11-point cyclic convolutions, Q( n ) is given by 

1000000000000001 1 1 110000000001 1 11 1000000000 
1000010000101110000100001011100000000000000 
1000100010010110001000100101100000000000000 
1001000101011000010001010110000000000000000 
1010001001100010100010011000100000000000000 
1 100001 1 1000001 100001 1 100000100000000000000 
1000010000101110000000000000000001000010111 
1000100010010110000000000000000010001001011 
1001000101011000000000000000000100010101100 
1010001001100010000000000000001000100110001 
1100001110000010000000000000010000111000001 
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the transpose of R( n ) is given by 

1100000101101011111100001100001111000011000 
1000001111111101111110101001111111101010011 
1000010110011011111000110000011111001100000 
1000110001011011110101000000011111010000010 
1001111101011011101110000001011111100000100 
1011110101011011011100000010011111000011000 
1111110101011010111100001100011111101010011 
1111110101011011111110101001111110001100000 
1111100101011011111100110000011101010000000 
1111000101011111111101000001011011100000010 
1110000101010011111110000010010111000000100 

and the transpose of p( n ) is given by 

1100001110000010000000000000010000111000001 
1010001001100010000000000000001000100110001 
1001000101011000000000000000000100010101100 
1000100010010110000000000000000010001001011 
1000010000101110000000000000000001000010111 
1100001110000011000011100000100000000000000 
1010001001100010100010011000100000000000000 
1001000101011000010001010110000000000000000 
1000100010010110001000100101100000000000000 
1000010000101110000100001011100000000000000 
100000000000000 1 1 1 1 1 000000000 1 1 1 1 1000000000 
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